Individual identity information persists in learned calls of introduced parrot populations

Animals can actively encode different types of identity information in learned communication signals, such as group membership or individual identity. The social environments in which animals interact may favor different types of information, but whether identity information conveyed in learned signals is robust or responsive to social disruption over short evolutionary timescales is not well understood. We inferred the type of identity information that was most salient in vocal signals by combining computational tools, including supervised machine learning, with a conceptual framework of “hierarchical mapping”, or patterns of relative acoustic convergence across social scales. We used populations of a vocal learning species as a natural experiment to test whether the type of identity information emphasized in learned vocalizations changed in populations that experienced the social disruption of introduction into new parts of the world. We compared the social scales with the most salient identity information among native and introduced range monk parakeet (Myiopsitta monachus) calls recorded in Uruguay and the United States, respectively. We also evaluated whether the identity information emphasized in introduced range calls changed over time. To place our findings in an evolutionary context, we compared our results with another parrot species that exhibits well-established and distinctive regional vocal dialects that are consistent with signaling group identity. We found that both native and introduced range monk parakeet calls displayed the strongest convergence at the individual scale and minimal convergence within sites. We did not identify changes in the strength of acoustic convergence within sites over time in the introduced range calls. These results indicate that the individual identity information in learned vocalizations did not change over short evolutionary timescales in populations that experienced the social disruption of introduction. Our findings point to exciting new research directions about the robustness or responsiveness of communication systems over different evolutionary timescales.

The annotated code and knitted RMarkdown files supporting the accompanying article are available on GitHub: https://github.com/gsvidaurre/identity-information-post-introduction.

Summary statistics across social scales and ranges
We described contact call recording and pre-processing steps for both native and introduced range calls in previous work [1,2]. For this research, we used a dataset of contact calls across social scales and ranges that contained 1582 calls. The individual scale dataset contained calls from individuals repeatedly sampled over short timescales (typically a few minutes), while the site scale dataset contained a single call for each "unique" individual. Unless otherwise specified, throughout these materials we used "site" to refer to the nesting sites at which recording sessions were conducted. We sampled some introduced range sites in more than one year. While harmonizing site codes for introduced range sites recorded over time in Austin (based on spatial proximity), we merged 2 pairs of native range sites into a single site each (PFER-01 and PFER-03 were merged into PFER, while RIAC-01 and RIAC-02 were merged into RIAC). These recording sites had been kept separate in previous work to assess fine-scale patterns of geographic variation [1]. This merging yielded 2 fewer sites in the call dataset for native range site scale compared to previous work [1]. See the knitted RMarkdown reports and scripts provided in the GitHub repository associated with this paper, as well as Tables A3 -A5 in [2] for more information about site names, localities, geographic coordinates, and the number of calls sampled per individual and site.
The variation in geographic distances among the sites that we used to compare hierarchical mapping patterns between ranges are reported in the main text. For our temporal analyses in the introduced range, we compared calls recorded in 2004 among 3 sites recorded in Texas (5.59 -15.79 km apart), as well as 3 sites in Louisiana (4.09 -10.43 km apart). We also compared calls from 5 sites recorded in Texas in 2011 (distances of 0.75 -7.03 km), and then compared calls from 2 sites in Louisiana in 2011 that were 1.19 km apart.
For our third temporal sampling comparison in 2019, we used 6 sites in Texas that were 1.56 -7.95 km apart.

Overview of repeated sampling of unmarked individuals
We obtained the majority of contact calls in this study from unmarked individuals. At the site scale, we obtained a single contact call per unmarked bird. Therefore, some calls in the site scale dataset may have represented repeated sampling of one or more unmarked individuals, which could influence resulting patterns of acoustic variation at the site scale in each range.
We addressed this issue by using three data pre-processing approaches that accounted for repeated sampling of individuals and yielded different versions of the site scale dataset. First, we used the full site scale dataset per range, which represented no filtering for potential repeated sampling of individuals. Next, we identified site scale calls that were most likely to represent repeated sampling of individuals per site-year in each range, and used two approaches (clustering and visual classification) to identify calls from repeated sampling of individuals. Throughout the following sections, the term "site-year" was used to account for the fact that some introduced range sites were recorded in different years, and calls for such sites were not combined across years during the analyses discussed below.

Accounting for repeated sampling of unmarked birds at the site scale
Although we considered each call in our site scale dataset representative of a "unique" individual, some calls may have instead represented repeated sampling of the same unmarked individuals within sites. After preliminary visual inspection, we suspected that repeated individual sampling was higher in the introduced range. We reasoned that keeping or removing calls that represented repeated individual sampling could change the patterns of call convergence that we identified at the site scale. To address this concern, we used three approaches to yield different versions of the site scale dataset that allowed us to determine how repeated sampling of individuals may have influenced the patterns of acoustic convergence that we reported at the site scale.
In our first approach, we used all calls recorded per site (e.g. the full site scale dataset), which addressed the possibility that removing similar calls could lead us to underestimate acoustic convergence at the site scale. Since including calls of repeatedly sampling individuals could lead us to instead overestimate acoustic convergence at the site scale, we used two distinct approaches to identify calls that were likely to represent repeatedly sampled individuals (S1 Appendix section 4). We used Gaussian mixture modeling on the SPCC similarity matrix to cluster calls, and we considered each cluster a repeatedly sampled individual. We also presented the clustered calls to six observers in a visual classification interface we designed in RShiny, a platform for dynamic web applications [3].
Observers visually inspected and modified the assignment of calls to clusters (e.g. potential repeated individuals) in the Shiny app. We used results from each of our clustering and visual classification approaches, respectively, to filter out calls attributed to repeatedly sampled individuals, which yielded 2 smaller versions of the site scale dataset. To filter calls, we randomly sampled one call to keep for each repeated individual that was identified per site by either clustering or visual classification, and we also kept all calls per site that were not attributed to repeated individuals.
Our 3 approaches to address repeated sampling of unmarked individuals within sites yielded 3 versions of the site scale dataset: (a) the full dataset, and datasets filtered to remove the calls that we attributed to repeatedly sampled individuals by either our (b) clustering or our (c) visual classification approaches. We used these 3 site scale datasets in subsequent analyses of hierarchical mapping patterns to assess the robustness of our results to the inclusion or the removal of calls of unmarked individuals that may have been repeatedly sampled as "unique" birds. We also assessed the degree of repeated individual sampling that occurred in our call datasets in each range by assessing how many potential repeated individuals were identified and how many calls were attributed to repeatedly sampled individuals by each of our clustering and visual classification approaches.

Preparing to identify calls of repeatedly sampled individuals
We used the individual scale dataset (e.g. known repeatedly sampled individuals) to identify calls that were most likely to represent repeated individual sampling at the site scale. We used the degree of overlap between these distributions to identify a threshold that split SPCC values into each distribution (e.g. within individual versus among individual comparisons) better than expected by chance. This SPCC value was representative of a split between pairwise comparisons within and among known repeatedly sampled individuals, and could therefore be used to identify pairwise comparisons more likely to represent repeated sampling of individuals in the site scale dataset. Each SPCC distribution was binned by iterating over different numbers of total bins, as the number of total bins could influence the degree of overlap between distributions. The first bin that yielded less than 50% overlap (less than chance) between distributions was identified. We used the minimum SPCC value of this bin as a similarity threshold (0.4384) to identify pairwise comparisons greater than or equal to this threshold in the site scale dataset. In other words, we used pairwise call comparisons that met this condition in the site scale dataset to identify calls that were likely to represent repeated sampling of unmarked individuals.

Identifying calls of repeatedly sampled individuals
In our first approach to identify repeated sampling of individuals at the site scale, we used Gaussian mixture modeling as a clustering method on SPCC values per site-year and range.
We performed clustering with the package mclust version 5.4.1 [4]. Each resulting cluster per site-year was considered a potential repeated individual. Here, we did not set the number of mixture components (e.g. the number of clusters) but rather allowed models to identify the "best" number of clusters, as we did not have a priori expectations for the number of potential repeated individuals present per site-year.
In our second approach, we designed a dynamic visual classification interface with RShiny, a platform for dynamic web applications [3]. In this custom-designed visual classification app, we used the number of clusters identified by Gaussian mixture models per site-year as classes to present spectrograms to users. Spectrograms of calls were presented on a separate page per site-year, and were grouped by cluster on each page. The purpose of this visual classification approach was to provide finer-scale control over how repeated individuals were identified compared to Gaussian mixture models. From preliminary visual inspection of the calls clustered together by Gaussian mixture models, we found that these models often lumped together visibly distinctive calls in the same cluster, and also assigned visibly similar calls across different clusters. For both of these approaches to identify calls of repeatedly sampled individuals (clustering alone, or clustering followed by visual classification), we used an earlier version of the site scale dataset with 1596 calls that included 14 calls from known repeatedly sampled individuals (e.g. 1 call each for 14 native and introduced range individual recorded at sites included in the site scale dataset, such that 14 calls were duplicated across the individual and site scale datasets). We dropped these 14 calls for our main analyses so we could perform multidimensional scaling on all calls across social scales for call similarity measurements.
In our visual classification approach, six observers (including the 5 co-authors) were prompted to assess visible patterns of structural similarities and differences among calls.
Observers could decide whether to 1) leave class assignments unchanged, 2) move spectrogram images among classes if they disagreed with how calls had been grouped, or 3) create new classes that represented additional repeated individuals. Observers could also indicate which calls they did not think represented repeated sampling of individuals. Call metadata, including geographic information, was hidden from observers and each observer had to pass a training module (1 native and 2 introduced site-years with known repeatedly sampled individuals) in which they had to identify known repeatedly sampled individuals with 75% classification accuracy, slightly higher than the individual scale classification accuracy reported in previous work [1]. This multi-observer approach split the visual classification task across co-authors. Native and introduced site-years were randomly split as evenly as possible across observers (12 -16 site-years assessed per observer).
We settled on this multi-observer approach to divide the otherwise time-intensive workload of visually classifying calls for 63 site-years. In addition, native range and 2004 introduced range contact calls had each already been pre-processed by a different observer to address potential repeated sampling of unmarked individuals [1,5]. We felt that a multiobserver approach was important to minimize bias that could arise from a single (and different) observer performing visual assessment of potential repeated individuals per range.
In our multi-observer visual classification approach, four sites were evaluated by each observer to facilitate analysis of multi-observer reliability. These four sites were chosen to represent ranges, numbers of calls per site-year, and temporal differences in the introduced range as evenly as possible. However, since native range sites typically had fewer calls sampled than introduced range sites, and only introduced range sites were sampled over time, we used 1 native and 1 introduced range site with 25 calls each, and 1 introduced range site sampled in 2 years with > 50 calls for multi-observer reliability analyses.

Multi-observer reliability of visual classification
Observers could redo the training module until they achieved high overall training accuracy.
While checking the training data, we realized that the training data for one observer was only partially saved. However, the mean ± standard error (SE) training accuracy for the final training run across all 6 observers (78.45% ± 3.73%) was similar to training accuracy for the remaining 5 observers: 78.33% ± 3.97%. We then parsed each observer's prediction data (e.g. all site-years assessed following training) for validation analyses. First, we carried out multi-observer reliability analyses in which we evaluated how many potential repeated individuals were identified at the 4 sites classified by all observers. Second, we calculated the mean ± SE of the number of potential repeated individuals identified per site-year per range, and the number of calls attributed to these individuals by both methods used to assess repeated individual sampling (clustering via Gaussian mixture models, and visual classification). These statistics allowed us to assess the degree of repeated individual sampling identified per range, and we calculated these statistics after removing the 4 sites used for multi-observer reliability. We identified high variability in how many potential repeated individuals were identified per site-year across observers. Fleiss' kappa, a measurement of multi-observer reliability calculated with the package irr version 0.84.1 [6], was both negative and very small (-0.035). This low kappa value reflected low agreement among observers, and overall, the difficulty of identifying potential repeated individuals per site without a priori information. See the knitted RMarkdown reports in the GitHub repository associated with this paper.

Generating site scale datasets for later analyses
We used our analysis of repeated individual sampling at the site scale to generate 3 versions of the site scale dataset. We used the full set of calls as the first site scale dataset (this contained all calls per site for 1177 calls total). To obtain our second site scale dataset, we filtered the full set of site scale calls by potential repeated individuals identified in our clustering approach with Gaussian mixture models. To do so, we randomly selected one call per potential individual (i.e. cluster) identified per site. We retained all calls for potential individuals for which only a single call had been assigned, as well as calls below the SPCC threshold that were not used in the clustering analyses (515 total calls). To yield the third dataset of the site scale calls, we randomly selected one call per potential individual identified per site in the visual classification data obtained across observers. We also retained calls that had been attributed to "unique" individuals by observers, as well as calls for potential repeated individuals that had been assigned only a single call, and calls below the SPCC threshold (e.g. calls not used in the visual classification app) per site (618 calls total). For the 4 sites assessed by all observers (S1 Appendix sections 5 and 6), we randomly selected data from one observer per site to retain in the final site scale visual classification dataset. Finally, we identified 4 calls for one repeatedly sampled individual in the native range that had been inadvertently retained for one site at the site scale in previous work [1]. Neither the retention nor removal of these 4 calls affected the Mantel test results reported in [1], but we nevertheless dropped these calls from the site scale dataset (see the associated code on GitHub).

Obtaining acoustic and image measurements for supervised machine learning
We used supervised random forests models and spectrographic cross-correlation (SPCC) as complementary similarity measurements. Here we describe our random forests modelling approach in more detail. We obtained a large number of acoustic and image measurements to assemble a set of predictors while building supervised random forests models. These acoustic measurements included call similarity measurements as well as several other types of acoustic measurements.
To obtain different call similarity measurements, we measured SPCC on spectrograms and Mel-frequency cepstral coefficients (MFCCs, which reflect how energy is allocated across different frequency bands), dynamic time warping (DTW) on spectral entropy and dominant frequency time series (each estimated at 100 timepoints per call), and multivariate DTW on spectral entropy and dominant frequencies with the warbleR and dtw packages [7,8]. We performed SPCC on spectrograms and MFCCs using a Hanning window, window length of 378, window overlap of 90, a bandpass filter of 0.5 to 9kHz, and 40 warped spectral bands for MFCC cross-correlation. We performed cross-correlations with warbleR version 1.1.27 in R version 4.1.2 [7,9]. Unless otherwise specified, we used the same settings and package versions for the majority of measurements used as predictors for supervised machine learning.
We also obtained a standard set of 27 spectral acoustic measurements (including duration and frequency measurements, as well as kurtosis and skew [7]), 88 descriptive MFCC statistics (performed with 12 total cepstral coefficients and 40 warped spectral bands), and over 2000 spectrogram image measurements obtained from the software WNDCHRM version 1.6 [1,10]. The MFCC descriptive statistics included the mean, median, minimum, maximum, kurtosis, skewness, and variance of each cepstral coefficient, as well as the mean and variance of the first and second derivatives. The spectrogram image measurements we used are described in more detail in the accompanying code and in [10]. To generate spectrograms for image measurements, we used the warbleR package [7] and the Fourier transformation parameters described above.

Deriving features for supervised machine learning
We derived acoustic features from the acoustic similarity matrices and the other datasets of acoustic measurements. We used these features as predictors for random forests models, which allowed us to reduce collinearity among the original measurements, and also allowed us to convert similarity matrices to a lower-dimensional, tabular format compatible with random forests models.
To derive acoustic features from acoustic similarity matrices, we applied non-metric multidimensional scaling (MDS) to each similarity matrix to yield low-dimensional solutions with 30 dimensions total. We implemented MDS with the package MASS version 7.3-51.6 [11] over 1000 maximum iterations. We derived acoustic features from 5 similarity matrices: 1) SPCC with spectrograms, 2) SPCC with MFCC, 3) dynamic time warping (DTW) on spectral entropy timeseries, 4) DTW with dominant frequency time series, and 5) multivariate DTW with spectral entropy and dominant frequencies. Each matrix represented the full dataset of calls across the individual and site scales in each of the native and introduced ranges.
Next, we obtained features from acoustic measurements by applying principal components analysis (PCA) to each set of measurements. We pre-processed measurements prior to running PCA by using the caret package version 6.0-80 [12] to perform Yeo-Johnson transformations, centering and scaling, and to remove variables with near zero variance. PCA was performed with the base package stats in R version 3.4.4 [9]. We derived acoustic features using PCA from the following 3 acoustic measurement datasets: 1) standard spectral acoustic measurements, 2) MFCC descriptive statistics, and 3) spectrogram image measurements. MDS and PCA features were generated for the full dataset of calls across the individual and site scales for both native and introduced range monk parakeet contact calls.
We checked collinearity among all of these features with a Pearson's correlation analysis. We also checked for high correlations between features and signal to noise ratio, but found no relationships with Pearson's r > 0.75. We dropped features with Pearson's | r | greater than 0.75 in the collinearity analysis, yielding 1844 predictors for machine learning.

Splitting calls into datasets for supervised machine learning
As in previous work, we used individual scale calls for random forests model training and validation in a classification approach, in which class labels were individual identities [1]. We

Model training, validation, and prediction
We trained and tuned a random forests model with the full set of 1844 predictors over 5 iterations of repeated 5-fold cross-validation using caret version 6.0-86 [12] and ranger version 0.12.1 [13]. We performed supervised machine learning analyses in R version 3.6.3, and again in R 4.1.2 (results in the main text are reported from this later run) [9]. The mtry parameter, or the number of predictors randomly selected at each split, was tuned to 186 for the first model built with 2000 trees. We did not vary the number of total trees, as training accuracy showed little change in response to the size of the forest in previous work [ Random forests performance can sometimes be improved by retaining only variables of relatively high importance. We used an automated feature selection algorithm from the Boruta package version 7.0.0 [14] to identify the most important features from the first random forests model. We then used these 114 features to build a second random forests model that was trained and tuned in the same way as the first. The second model achieved 97.44% which identified an optimal number of 8 clusters, similar to the 9 total individuals used between ranges. In addition, clustering accuracy for the native range was the same as reported in previous work, with only a single call misclassified for one individual (see RMarkdown files provided). We designed this validation approach to validate the way in which we used random forests for prediction of call similarity with the full site scale dataset, in that we ignored call classifications produced by the model and extracted the resulting proximity matrix as the random forests similarity matrix [1, [15][16][17][18]. See the code provided on GitHub for more information.

Patterns of acoustic variation over social scales and ranges
We compared patterns of acoustic variation per social scale and range in two-dimensional space by using non-metric MDS to reduce SPCC and random forests similarity matrices per social scale. MDS was performed for all calls per social scale and per similarity method with the package MASS version 7.3-51.6 using 1000 maximum iterations [11]. We optimized MDS to yield stress values close to 0.05, which we used as a general rule-of-thumb to identify optimal MDS solutions with low stress [19].   [2] for full site names and more information about the localities where calls were recorded for these individuals and sites. In Fig 2A, the native range individual used for the individual scale call lexicon was NAT-UM5, while the introduced range individual was INT-UM16. In Fig 3A, the site used for the native range site scale call lexicon was LENA, and the introduced range site was INTR.

Using Earth Mover's Distance to quantify acoustic convergence
Earth Mover's Distance represents the amount of work that it takes to convert one distribution into another [20]. for developing this code [22]. Since we did not have a priori expectations for an optimal number of bins to use, we iterated over six total bin numbers (2, 4, 8, 16, 32, 64) to calculate Earth Mover's Distance. These six bins represented six powers of 2, or a sequence of numbers that represented coarser to finer bin resolution. We calculated Earth Mover's Distance across these total bin numbers using Euclidean distance and 5000 maximum iterations.
We designed a resampling routine to generate Earth Mover's Distance calculations that could be compared across social scales and ranges. In this customized routine with 100 resampling iterations, we randomly sampled 5 exemplars of each category (e.g. individuals or sites) from each of the native and introduced ranges, and also randomly sampled 5 calls for each of these randomly sampled individuals or sites that had 5 or more calls. In these comparisons of acoustic convergence between ranges and over time in the introduced range, we did not have a priori expectations for an "optimal" number of bins to use for the Earth Mover's Distance calculations. Therefore, we assessed whether these calculations were consistent across total bin numbers by calculating means and 95% confidence intervals (CIs) for each total bin number (e.g. by summarizing across resampling iterations for each number of total bins). Earth Mover's Distance values were consistent across these bin numbers for the individual and site scale in each of the native and introduced ranges ( Figures S3 and S4). We show plots of the raw data generated by our resampling routine (e.g. before calculating means and 95% Cis) in the RMarkdown output files that are available on GitHub. We performed the same quality control checks for our analyses in the introduced range over time, and confirmed that Earth Mover's Distance calculations were also consistent across bins in for the two introduced range cities that we sampled over time

Population trends in the introduced range
We addressed the possibility of significant population growth in the introduced range by assessing population trends in two cities where we recorded parakeets over time. We used eBird checklists to assess population trends in Austin, Texas, U.S. and New Orleans, counties, and we did not include unvetted data. The package auk version 0.4.1 [24] was used to read in eBird data in R version 3.6.3 [9], and we filtered the county scale eBird data to obtain sightings for each city.
We averaged the frequency of monk parakeet sightings in complete checklists relative to all other species over weeks and years to yield an annual percentage of monk parakeet sightings per city. We used this annual percentage of monk parakeet sightings to evaluate population trends over the span of our sampling years [25]. Although we did not directly evaluate changes in local population size over time, we considered the frequency of monk parakeet sightings over time to be an indicator of relative changes in population size over time, as larger populations of monk parakeets should yield more frequent sightings than smaller populations. We calculated the frequencies of monk parakeet sightings relative to all other species reported in complete eBird checklists, to correct for the general trend of increased eBird reporting over time [25]. We found that the annual percentage of monk parakeet sightings, which represented the frequency of monk parakeet sightings relative to other avian species, remained low in Austin and New Orleans throughout our sampling years (S7 Fig). The mean annual percentage of monk parakeet sightings never rose above 5% relative to all species observed in complete checklists between 2004 and 2019 for either city.
See the code and knitted RMarkdown reports provided on GitHub for more information.

Choosing individuals and sites for Mantel tests
We used Mantel tests to assess hierarchical mapping patterns over social scales

Using Mantel tests to compare hierarchical mapping patterns between the native and introduced ranges
After evaluating hierarchical mapping patterns in acoustic space, we used Mantel tests to assess acoustic convergence at each social scale for both of the native and introduced ranges. We performed Mantel tests with matrices of call similarity (SPCC for the individual scale; random forests and SPCC for the site scale) and matrices of binary identity per social scale over 9999 permutations. We performed Mantel tests with 9999 permutations over social scales and ranges using the package vegan version 2.5-6 [26]. These tests allowed us to ask whether calls were more similar within an individual or site compared to among individuals or sites for both the native and introduced ranges.
We conducted Mantel tests with binary matrices of either individual or site identity. We generated these binary matrices by assigning calls compared within the same individual or site the value of 0, and a value of 1 to calls compared among individuals or sites. Call similarity and binary identity similarity matrices were converted to distances by subtracting matrices from 1. We performed a partial Mantel test for the introduced range individual scale using a third matrix of binary site identity, as these individuals were sampled over more than  Appendix). We also found statistically significant convergence within sites in Austin in 2019 and New Orleans in 2004 using the full dataset of calls and both similarity methods (Table E in S1 Appendix).

Effects of repeated individual sampling and geographic isolation
While we identified stronger individual signatures than convergence within sites in each range, we also identified generally higher convergence at the site scale for the introduced range compared to the native range (Table B in S1 Appendix). If this difference in the magnitude of site scale convergence between ranges reflected a change in identity information encoding, then we should have consistently identified the strongest acoustic convergence at this social scale in introduced range calls. Instead, we found that the magnitude of site scale acoustic convergence for the introduced range was sensitive to the call dataset that we used (Table B in S1 Appendix). In addition, the strength of convergence that we identified at the site scale for the introduced range was consistently lower than individual scale convergence in each of the native and introduced ranges ( Table B in S1 Appendix). Therefore, the higher acoustic convergence that we reported at the site scale for the introduced range may reflect repeated sampling of unmarked individuals within sites, which is further supported by the greater degree of repeated individual sampling in our introduced range site scale dataset (Table 1).
However, the higher convergence that we identified for the introduced range site scale may also reflect the geographic isolation of introduced range populations. Introduced range monk parakeet populations are largely confined to city boundaries in the U.S. [27] and are more geographically disjunct than native range populations that exhibit greater contiguity over large geographic areas (Fig 1; Smith

Preparing for comparative analyses
We used previously published calls of yellow-naped amazons (Amazona auropalliata), a parrot species that exhibits convergence on calls within sites and regional populations [29]  The mean and overall distribution of similarity values for yellow-naped amazons matched the range of SPCC values originally reported in [29]. For additional validation, we also performed MDS with 15 total dimensions on the SPCC matrix for yellow-naped amazon contact calls and generated a plot of low-dimensional acoustic space to evaluate whether we could reproduce results reported in the original 1996 study (panel D, S8 Fig). After confirming that we had reproduced the quantifiable acoustic differences among regional vocal dialects originally reported for yellow-naped amazons in 1996, we moved on to our comparative analysis.

Selecting SPCC values for comparative analyses
In this analysis, our goal was to quantify and directly compare hierarchical mapping patterns among native range monk parakeets, introduced range monk parakeets, and yellow-naped amazons. We performed SPCC for each species after matching sampling rates of audio files per species (S1 Appendix section 19). For yellow-naped amazons, we used previously published contact calls recorded in Costa Rica in 1994 [29]. We selected SPCC values for a subsample of individuals or groups at each social scale that represented similar sampling depth and geographic breadth for each range and species. We used these subsamples of individuals or groups to compare SPCC values within or among these categories at each social scale.
At the individual scale, we used 5 repeatedly sampled native range monk parakeets recorded at the same site in the department of Colonia, Uruguay in 2017. We chose 5 repeatedly sampled introduced range monk parakeets sampled at 3 sites close to each other in the city of Austin, TX, U.S. in 2019. We decided to use these introduced range individuals because parakeets recorded at more distant sites were more likely to overlap more in acoustic space [2]. For yellow-naped amazons, we chose 12 repeatedly sampled individuals from 3 randomly sampled North dialect sites in Costa Rica.
For the site scale, we selected 6 randomly sampled sites for native range monk parakeets that we had recorded in the department of Colonia, Uruguay in 2017. We chose 6 sites for introduced range monk parakeets that we had recorded in the city of Austin, U.S. in 2019. We chose these particular sites for native and introduced range monk parakeets to obtain similarity values representing call comparisons over a geographic breadth similar to yellow-naped amazons. For yellow-naped amazons, we used 7 sites recorded in the North dialect geographic region of Costa Rica, which spanned less than 40km in 1994 [29]). Two to four birds were usually repeatedly sampled per site in the original dataset of [29].
We also quantified acoustic convergence at another social scale, the regional dialect scale, for yellow-naped amazons only. We selected calls recorded from the North and South regional dialects in Costa Rica [29] for this analysis. Native range and U.S. introduced range monk parakeets did not exhibit visually distinctive regional vocal dialects in the contact calls that we sampled [1,2,5].
We then selected SPCC values for the subsampled individuals and groups from the full SPCC matrix for native range monk parakeets, introduced range monk parakeets, and yellownaped amazons. We obtained SPCC values that represented pairwise comparisons among calls within or among these different categories (e.g. within the same individual or group as well as among different individuals or groups). We plotted density curves of these SPCC values within the same category or among different categories across social scales for each species and range (Fig 6A).
In these plots, we used greater separation between the distributions of similarity values

Bootstrapping SPCC values for comparative benchmarking analysis
The main bootstrapping analysis we performed is described in the main text. For native range and introduced range monk parakeets, we also performed this bootstrapping analysis at the site scale across the 3 datasets that we used to account for repeated sampling of unmarked individuals. We found generally similar results at the site scale for native and introduced range monk parakeets, and we reported main results for monk parakeets from the full site scale dataset (Fig 6B). See the code provided on GitHub for bootstrapping analyses and results across monk parakeet site scale datasets.